This analysis aims to provide some insight about the car crashes in New York during the period of 2016 to 2022, focusing on some key factors such as vehicle type, hour and weather.
We count with 2 datasets, one cointaining the vehicle crashes and the second one about the weather of the location.
| Name | Rows | Columns | Each row is a | Link |
|---|---|---|---|---|
| Vehicles dataset | 2.11M | 29 | Motor Vehicle Collision | Vehicles dataset |
| Weather dataset | 59,760 | 10 | Time stamp of Weather | Weather dataset |
In this step we will collect the data from the datasets, clean it and merge it to create a comprehensive dataset for analysis. Before merging, the data needs to be cleaned and enriched with additional information. Also shrinking the data to make it more manageable.
The data cleaning process involves removing columns with high NA ratios, filtering out rows with missing values, and creating new columns to categorize the main causes of accidents. The data is then enriched with additional information such as the day of the week, month, quarter, year, and time of day.
Both datasets, specially vehicles.csv, contain a lot of
rows which require a lot of memory and time to process. For this reason,
we decide to eliminate rows with missing values and columns with high NA
ratios, such as vehicle type 3, 4 and 5 as there are very few values in
these columns (multiple vehicle accidents).
Weather.xlsx is a smaller dataset and the cleaning
process is simpler, we just need to convert the time column to a correct
format and rename some columns for better understanding. Finally the
data is then merged with the weather data to create a comprehensive
dataset for analysis.
At this point we have two clean datasets, one with the vehicle crashes and the other with the weather data. Vehicle crashes are reduced considerably in size to about half the rows. We will merge them into a single dataset to perform the analysis.
The result is:
| Name | Rows | Columns | Each row is a |
|---|---|---|---|
| Merged dataset | 1M | 40 | Combination of vehicle and weather data |
After merging the data, we will save each year to a separate file to make it easier to analyze the data by year. We will perform both analyses on the full dataset and on years separately.
In this graph we can see the total number of accidents per year in New York from 2016 to 2022. The number of accidents seems to be decreasing over the years, which is a positive trend.
It is important to note that we cannot draw any conclusions from this graph alone, as there may be other factors influencing the number of accidents, such as the pandemic of COVID-19 and the lockdowns that occurred in 2020 and 2021, which could have reduced the number vehicles on the road and therefore the number of accidents.
## [1] "Correlation coefficient: 0.59"
This graph shows the correlation between the total number of
accidents and the total rainfall per month in New York in 2021. The
result of 0.59 indicates a moderate positive correlation
between the two variables, suggesting that higher rainfall may lead to
more accidents.
Monthly Number of Accidents by Rainfall Category with Monthly Rainfall
This graph shows the monthly number of accidents in New York in 2021, categorized by rainfall intensity. The black line represents the total number of accidents per month, while the blue bars represent the total rainfall per month on a secondary y-axis. The dots represent the number of accidents in each rainfall category.
It can be appreciated that the number of accidents tends to increase with higher rainfall, especially in the “1 mm - 4 mm (Light rain)” and “>4 mm - 7 mm (Moderate rain)” categories.
At the same time, the majority of accidents occur in no rain conditions, which could be due simply to the fact that most of the time there is no rain in New York. Hence, ithout the a total amount of vehicles on the road, it is difficult to draw conclusions from this data alone.
Esquisse is a package that allows you to create interactive plots and dashboards in R. It is similar to Tableau in that it provides a user-friendly interface for creating visualizations without writing code.
To use Esquisse, you need to install the package and then load it in
your R script. After loading the package, you can launch the web app by
calling the esquisser() function with your data as an
argument. That will open a web browser with the Esquisse interface,
where you can create plots interactively by dragging and dropping
variables. Esquisse works with the plotly package to create
interactive plots.
At the botom left of the pane, in the options tab, you can select to
make the plots with plotly to make them interactive. Once
active, you can hover over the plots to see the data points and values
or click on the legend to filter the data.